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Abstract 

We develop a novel design framework for dynamic distributed spectrum sharing among secondary 
users (SUs), who adjust their power levels to compete for spectrum opportunities while satisfying the 
interference temperature (IT) constraints imposed by primary users. The considered interaction among 
the SUs is characterized by the following three unique features. First, the SUs are interacting with each 
other repeatedly and they can coexist in the system for a long time. Second, the SUs have limited 
and imperfect monitoring ability: they only observe whether the IT constraints are violated, and their 
observation is imperfect due to the erroneous measurements. Third, since the SUs are decentralized, 
they are selfish and aim to maximize their own long-term payoffs from utilizing the network rather 
than obeying the prescribed allocation of a centralized controller. To capture these unique features, we 
model the interaction of the SUs as a repeated game with imperfect monitoring. We first characterize 
the set of Pareto optimal operating points that can be achieved by deviation-proof spectrum sharing 
policies, which are policies that the selfish users find it in their interest to comply with. Next, for 
any given operating point in this set, we show how to construct a deviation-proof policy to achieve 
it. The constructed deviation-proof policy is amenable to distributed implementation, and allows users 
to transmit in a time-division multiple-access (TDMA) fashion. In the presence of strong multi-user 
interference, our policy outperforms existing spectrum sharing policies that dictate users to transmit at 
constant power levels simultaneously. Moreover, our policy can achieve Pareto optimality even when the 
SUs have limited and imperfect monitoring ability, as opposed to existing solutions based on repeated 
game models, which require perfect monitoring abilities. Simulation results validate our analytical results 
and quantify the performance gains enabled by the proposed spectrum sharing policies. 
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I. Introduction 

Cognitive radios have increased in popularity in recent years, because they have the potential to 
significantly improve the spectrum efficiency. Specifically, cognitive radios enable the secondary 
users (SUs), who initially have no rights to use the spectrum, to share the spectrum with primary 
users (PUs), who are licensed to use the spectrum, as long as the PUs' quality of service 
(QoS), such as the throughput, is not affected by the SUs (T). A common approach to guarantee 
PUs' QoS requirements is to impose interference temperature (IT) constraints [fT|||2lll3l[|5l- [fT3l : 
that is, the SUs cannot generate an interference level higher than the interference temperature 
limit set by the PUs. One of the major challenges in designing cognitive radio systems is to 
construct a spectrum sharing policy that achieves high spectrum efficiency while maintaining the 
IT constraints set by PUs. 

The spectrum sharing policy, which specifies the SUs' transmit power levels, is essential to 
improve spectrum efficiency and protect the PUs' QoS. Since SUs can use the spectrum as long 
as they do not degrade the PUs' QoS, they can use the spectrum and coexist in the system 
for long periods of time. In general, the optimal spectrum sharing policy should allow SUs to 
transmit at different power levels temporally even when the environment (e.g. the number of SUs, 
the channel gains) remains unchanged. However, most existing spectrum sharing policies require 
the SUs to transmit at constant power levels over the time horizon in which they interacf] |2]|- 
lfl4ll . These policies with constant power levels are inefficient in many spectrum sharing scenarios 
where the interference among the SUs is strong. Under strong multi-user interference, increasing 
one user's power level significantly degrades the other users' QoS. Hence, when the cross channel 
gains are large, the feasible QoS region is nonconvex [|20l . In this case of nonconvex feasible 
QoS region, a spectrum sharing policy with constant power levels is inferior to a policy with 
time-varying power levels in which the users transmit in a time-division multiple-access (TDMA) 
fashion, because the latter can achieve the Pareto boundary of the convex hull of the nonconvex 
feasible QoS region. 

Another important feature neglected in the design of spectrum sharing policies in recent works 
ED-ED is the selfishness of SUs, who aim to maximize their own QoS and may deviate from 

'Although some spectrum sharing policies go through a transient period of adjusting the power levels before the convergence 
to the optimal power levels, the users maintain constant power levels after the convergence. 
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the prescribed spectrum sharing policy, if by doing so their QoS can be improved. Hence, the 
spectrum sharing policy should be deviation-proof, which means that selfish SUs cannot improve 
their QoS by deviating from the policy. In this way, selfish SUs will find it in their self-interest 
to follow the policy. 

Given the fact that the SUs will interact with each other repeatedly when sharing the spectrum, 
we model the interaction among the SUs as a repeated game. In a repeated game, the stage 
game is played repeatedly, and a user's payoff in the repeated game is the discounted average 
of the stage-game payoffs (i.e. QoS in the stage games). Users can choose different actions (i.e. 
power levels) in different stage games, and the repeated-game payoff is a convex combination 
of different stage-game payoffs. A repeated-game strategy prescribes what action to take given 
past observations, and therefore, can be considered as a spectrum sharing policy. If a repeated 
game strategy constitutes an equilibrium, then no user can gain from deviation at any occasion. 
Hence, an equilibrium strategy is a deviation-proof spectrum sharing policy. 

The spectrum sharing policy in a repeated game framework was studied in [031 - 11181 . under 
the assumption of perfect monitoring, namely the assumption that each SU can perfectly monitor 
the individual transmit power levels of all the other SUs. In the policies in 031 - 11181 , when a 
deviation from the prescribed policy by any user is detected, a perpetual punishment phase [15| 
or a punishment phase of certain duration [16J[18J will be triggered. In the punishment phase, 
all the users transmit at the maximum power levels to create strong interference to each other, 
resulting in low QoS of all the users as a punishment. Due to the threat of this punishment, all 
the users will follow the policy in their self-interests. However, since the monitoring can never 
be perfect, the punishment phase, in which all the users receive low throughput, will be triggered 
even if no one deviates. Thus, the users' repeated-game payoffs, averaged over all the stage-game 
payoffs, cannot be Pareto optimal because of the low payoffs received in the punishment phases. 
Hence, the policies in [fT5l - [fT8l must have performance loss in practice where the monitoring 
is always imperfect. 

Repeated games with imperfect monitoring have been studied extensively in the game theory 
literature. In [fP9l . it is shown that for a general repeated game with imperfect monitoring, 
Pareto optimal operating points can be asymptotically achieved if certain sufficient conditions 
are satisfied. One sufficient condition requires the users to be able to statistically distinguish 
sufficiently many different actions. Translated to the spectrum sharing scenario, it requires the 
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SUs to be able to distinguish a certain number of interference temperature levels, where the 
number of distinguishable IT levels grows linearly with the number of power levels each user 
can choose from. This requirement indicates the need for a large amount of feedback information 
on IT levels. Moreover, another sufficient condition requires the users to be sufficiently patient, 
namely they discount future payoffs arbitrarily little (i.e., their discount factors are arbitrarily 
close to one). This requirement on the users' patience limits the scenarios to which the policy 
in |[T9l can be applied. 

In this paper, we design deviation-proof spectrum sharing policies with time-varying power 
levels to achieve Pareto optimal operating points that are not achievable by existing policies 
with constant power levels ll2Tl- lfl4l . We provide a systematic design approach, which first 
characterizes the set of Pareto optimal operating points achievable by deviation-proof policies, 
and then for any operating point in this set, constructs a deviation-proof policy to achieve it. The 
proposed policy can be easily implemented in a distributed manner. Moreover, we prove that the 
proposed policy can achieve Pareto optimal operating points, even when the SUs are impatient 
(namely they discount future payoffs, and their discount factor are strictly smaller than one), 
and have limited and imperfect monitoring ability. Specifically, their monitoring ability can be 
limited in that they only need to distinguish two IT levels regardless of the number of power 
levels each user can choose from, and their monitoring can be imperfect due to the erroneous 
measurements of the interference temperature^] This requirement on the users' monitoring ability 
is significantly relaxed compared to existing works based on repeated games, which require either 
perfect monitoring of all the users' individual transmit power levels lfT5l - |[T8l or sufficiently good 
monitoring to distinguish sufficiently many IT levels |fT9ll . 

We illustrate the performance gain of the proposed policies over the existing policies in Fig.[T] 
We show the best operating points achievable by different classes of policies in a spectrum 
sharing system with two SUs. Due to the strong multi-user interference, the best operating 
points achievable by policies with constant power levels ll2ll- |[T4l (the dashed curve) are Pareto 
dominated by the best operating points achieved by policies with time-varying power levels (the 
straight line). The proposed policy, which are deviation-proof, can achieve a portion of the Pareto 

2 As will be described later in this paper, there is an entity that regulates the interference temperature in the system, who 
measures the interference temperature imperfectly and feedbacks to the users a binary signal indicating whether the constraints 
are violated. 
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Fig. 1. An illustration of the best operating points achievable by different policies in a two-SU spectrum sharing system. 

TABLE I 

Comparison With Related Works In Dynamic Spectrum Sharing. 
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Proposed 
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optimal operating points (the thick line). Under imperfect monitoring, the policies designed under 
the assumption of perfect monitoring [TT5l - [[T8l (the solid curve) have large performance loss 
compared to the proposed policy. 

Finally, we summarize the comparison of our work with the existing works in dynamic 
spectrum sharing in Table [TJ We distinguish our work from existing works in the following 
categories: the power levels prescribed by the spectrum sharing policy are constant or time- 
varying, whether the policy can be implemented in a distributed fashion or not, whether the 
policy is deviation-proof or not, and what are the requirements on the SUs' monitoring ability. 



August 10, 2012 



DRAFT 



6 



The "monitoring" category is only discussed within the works based on repeated games. 

The rest of the paper is organized as follows. In Section [IlJ we describe the system model 
for dynamic spectrum sharing. Then, in Section [Till we formulate the policy design problem 



using repeated games. We solve the policy design problem in Section IV Simulation results are 
presented in Section |Vj Finally, Section |VI] concludes the paper. 



II. System Model For Dynamic Spectrum Sharing 

We consider a system with one primary user ^ and N secondary users (see Fig [2] for an 
illustrating example of a system with two secondary users). The set of SUs is denoted by 
Af — {1, 2, ... , N}. Each SU has a transmitter and a receiver. The channel gain from SU i's 
transmitter to SU j's receiver is g^. Each SU i chooses a power level Pi from a finite set V%. In 
other words, each SU choose from discrete power levels. We assume that E V%, namely SU i 
can choose not to transmit. We define SU i's maximum transmit power as p™ ax = mwc Pi ^p i pi. 
The set of joint power profiles is denoted by V = Yliejv an ^ me j° mt power profile of all the 
SUs is denoted by p = (pi, . . . ,p^) E V. Let p j be the power profile of all the SUs other than 
SU %. Each SU i's instantaneous payoff (QoS) is a function of the joint power profile, namely 
Ui : V —t R + . Each SU i's payoff «j(p) is decreasing in the other SUs' power levels pj, Vj 7^ i. 
Note that we do not assume that Ui(p) is increasing in p^But we do assume that Ui(p) = if 
Pi = 0, because a SU's payoff should be zero when it does not transmit. One example of many 
possible payoff functions is the SU's throughput: 

«i(p) = log 2 U + — — ) , (1) 

where is the noise power at SU i's receiver. 

As in [|8l-[[TT|. there is a local spectrum server (LSS) serving as a mediating entity among 
the SUs. The LSS has a receiver to measure the interference temperature and a transmitter to 
broadcast signals, but it cannot control the actions of the autonomous SUs. The LSS could be 
a device deployed by the PU or simply the PU itself, if the PU manages by itself the spectrum 

'Although we study a system with one PU as in (2)-(3 1 1 5 1-| 7 1 [ 12 1, our model and design framework can be easily extended 
to the scenario of multiple PUs located in different geographic regions. 

4 In some scenarios with energy efficiency considerations, the payoff is denned as the ratio of throughput to transmit power, 
which may not monotonically increase with the transmit power. 
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Fig. 2. An example system model with two secondary users. The solid line represents a link for data transmission, and the 
dashed line indicate a link for control signals. The channel gains for the corresponding data link are written in the figure. 
The primary user (PU) specifies the interference temperature (IT) limit to the local spectrum server (LSS). The LSS sets the 
intermediate IT limit to the secondary users and send distress signals if the estimated interference power exceeds the IT limit. 

leased to the SUs. Even when the PU is the LSS, it is beneficial to consider the LSS as a separate 
logical entity that performs the functionality of spectrum management. The LSS could also be 
a device deployed by some regulatory agency such as Federal Communications Commission 
(FCC), who uses it for spectrum management in that local geographic area. In both cases, the 
LSS aims to improve the spectrum efficiency (e.g. the sum throughput of all the SUs) and the 
fairness, while ensuring that the IT limit set by the PU is not violated. Note that the PU may 
also want to maximize the spectrum efficiency to maximize its revenue obtained from spectrum 
leasing, since its revenue may be proportional to the sum throughput of the SUs. 

The LSS measures the interference temperature at its receiver imperfectly. The measurement 
can be written as Ptdio + £ ' where 9m is the channel gain from SU i's transmitter to the 

LSS's receiver, and e is the additive measurement error. We assume that the measurement error 
has zero mean and a probability distribution function f £ known to the LSS. We assume as in 
most existing works (e.g. ED-JT21) that the IT limit I set by the PU is known perfectly by the 
LSS. Although the LSS aims to keep the interference temperature below the IT limit /, it will set 
a lower intermediate IT limit / < I to be conservative because of measurement errors. Hence, 
the IT constraint imposed by the LSS is 

T,ieMPi9iO < I. (2) 
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Even if the actual interference temperature ^2 iGj ^Pigio does not exceed the intermediate IT limit 
/, the erroneous measurement J2ieAfPi9io + £ ma y st iU exceed the IT limit I set by the PU. In 
this case, the LSS will broadcast a distress signal to all the SUs. Given the joint power profile 
p, this false alarm probability is 

T(p) = Pr (J2iexPi9M + £ > I I T,ietfPi9io < I) , ( 3 ) 

where Pi (A) is the probability that the event A happens. We can see that a larger intermediate 
IT limit I enables the SUs to transmit at higher power levels, but results in a larger false alarm 
probability and a higher frequency of sending distress signals. Hence, there is an interesting 
tradeoff between the spectrum efficiency and the cost of sending distress signals. 

A SU's payoff is affected by the multi-user interference j^iPjdji^ which is dependent 

on the cross channel gains among different SUs. When the multi-user interference is weak due 
to small cross channel gains, power control becomes less important, since one SU's power level 
does not affect the others' payoffs. Hence, in this paper, we focus on the more interesting 
scenario when the multi-user interference is strong and power control is essential for efficient 
interference management. We quantify the strength of multi-user interference as follows. First, 
we write p' = (p\, . . . ,p l N ) as the joint power profile that maximizes SU i's payoff subject to 
the IT constraint, namely 

p 4 = arg max (p), subject to £\ g ^ Pi9i0 < 1 . (4) 

Since U{ is decreasing in Pj,Vj ^ i, we have = 0, Vj ^ i. For notational simplicity, we 
define the maximum payoff achievable by SU i as Vi = Wt(p*). Then, we say a spectrum sharing 
scenario has strong multi-user interference if the following property is satisfied. 

Definition 1 (Strong Multi-user Interference): A spectrum sharing scenario has strong multi- 
user interference, if the set of feasible payoffs V = conv{u(p) = (iti(p), . . . , -U v(p)) : P £ 
'P>J2i£AfPi9io — I}* where conv(X) is the convex hull of X, has N + 1 extremal points^J 
(0,...,0)ER N , u^),...,u(p N ). 

This definition characterizes the strong interference among the SUs: the increase of one SU's 
payoff comes at such an expense of the other SUs' payoffs that the set of feasible payoffs 
without time sharing is nonconvex. A spectrum sharing scenario satisfies this property when 

5 The extremal points of a convex set are those that are not convex combinations of other points in the set. 
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the cross channel gains among users are large 112011 . In the extreme case of strong multi-user 
interference, simultaneous transmissions from different SUs result in packet loss, as captured in 
the collision model [21 J. According to this definition, the set of feasible payoffs can be written 
as V = conv{(0, . . . , 0), u(p 1 ), . . . , u(p Ar )}. Moreover, its Pareto boundary is B = {v G V : 
J2iLi v i/vi — 1) v i — 0>V*} as P ar t °f a hyperplane, which can be achieved only by SUs 
transmitting in a TDMA fashion. 

III. Formulation of The Policy Design Problem 

In this section, we first formulate the interaction among the SUs as a repeated game with 
imperfect monitoring, and define the deviation-proof spectrum sharing policy. Then, we formally 
define the policy design problem and outline our design framework to solve it. 

A. Formulation of The Repeated Game 

Similar to Il2l- lfl4l . we assume that the system parameters, such as the number of SUs and 
the channel gains, remain fixed during the considered time horizon. The system is time slotted at 

t = 0, 1, We assume that the users are synchronized as in Il2l- lfl4l . At the beginning of time 

slot t, each SU i chooses its power level p\, and receives a payoff «j(p*). The LSS obtains the 
measurement ^ ie j^p\gio+^ t , where e* is the realization of the error e at time slot t, and compare 
the measurement with the IT limit I. The set of measurement outcomes of the comparison Y 
has two elements, namely Y = {yo,yi}- The (measurement) outcome y t is determined by 



We write the conditional probability distribution of the outcome y given the joint power profile 
p as p(y\p), which can be calculated as 



At the end of time slot t, the LSS sends a distress signal if the outcome y l = y . Note that the 
LSS does not send signals when the outcome is yi, and the SUs know that the outcome is y\ 
by default when they do not receive the distress signal. 




(5) 




p(yo|p) 



1 - p(yi|p)- 



(6) 
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Note that in repeated games with perfect monitoring |[T5ll - |[T8ll , the outcome available to each 
SU at time slot t is precisely the joint power profile chosen by the SUs, i.e. y l = p*. We say 
the monitoring is imperfect if y l ^ p*. In a general repeated game with imperfect monitoring, 
in order to achieve Pareto optimality, the set of outcomes Y should have a large cardinality, 
namely \Y\> \Vi\ + \Vj\ — 1 for all i E M and all j ^ i |fT9ll . In contrast, our proposed policy 
can achieve Pareto optimality even when \Y\ = 2 regardless of the cardinality of the SU's action 
set Vi, 

At each time slot t, each SU i determines its transmit power p\ based on its history, which is 
a collection of all the past power levels it has chosen and all the past measurement outcomes. 
Formally, the history of SU i up to time slot t > 1 is h\ — y°; . . . ; p^ 1 , y l ~ 1 } E (Vi x Y) f , 
and that at time slot is h® = 0. The history of SU i contains private information about SU 
i's power levels that is unknown to the other SUs; in contrast, we define the public history 
as h 1 = {y°; . . . ;y l ~ 1 } E Y l for t > 1 and h° = 0. The public history h l only contains the 
measurement outcomes that are known to all the SUs. 

In this paper, we focus on public strategies, in which each SU's decision depends on the public 
history only. Hence, each SU i's strategy cr, is a mapping from the set of all possible public 
histories to its action set, namely <7j : U^ F* — >■ Vi. Due to realization equivalence principle ll25l 
Lemma 7.1.2], we lose nothing by only considering public strategies, in terms of the achievable 
Pareto optimal operating points. 

The spectrum sharing policy is the joint strategy profile of all the SUs, defined as cr = 
(<Ti, . . . , (Tjv). The SUs are selfish and maximize their own long-term discounted payoffs. As- 
suming, as in lfT51l - lfT9ll . the same discount factor 5 E [0, 1) for all the SUs, each SU i's (long-term 
discounted) payoff can be written as 



Ui(<r) = (1-5) 



t=l yt-i-eY 

where p° is determined by p° = cr(0), and p* for t > 1 is determined by p* = <r(/i*) = 
cr(/i' -1 ; y^ 1 ). The discount factor represents the "patience" of the SUs; a larger discount factor 
indicates that a SU is more patient. The discount factor is determined by the delay sensitivity 
of the SUs' applications. 

We define the deviation-proof policy as the perfect public equilibrium (PPE) of the game. The 
PPE prescribes a strategy profile cr from which no SU has incentive to deviate after any given 
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history at any time slot, and thus can be considered as a deviation-proof policy. It is normally 
more strict than Nash equilibrium, because it requires that the SUs have no incentive to deviate 
at any given history, while Nash equilibrium only guarantees this at the histories that possibly 
arise from the equilibrium strategy. We can also consider PPE in repeated games with imperfect 
monitoring as the counterpart of subgame perfect equilibrium defined in repeated games with 
perfect monitoring ll25l . 

Before the definition of PPE, we introduce the concept of continuation strategy: SU i's 
continuation strategy induced by any history h l G Y l , denoted Oi\ h t, is defined by cri\ h t(h T ) = 
ai(h t h T ),\/h T G Y T , where tfhJ is the concatenation of the history h l followed by the history 
h T . By convention, we denote cr\ h t and cr_i\ h t the continuation strategy profile induced by h* 
of all the SUs and that of all the SUs other than SU i, respectively. Then the PPE is defined as 
follows B251 Definition 7.1.2] 

Definition 2 (Perfect Public Equilibrium): A strategy profile cr is a perfect public equilibrium 
if for any public history h l G Y f , the induced continuation strategy <r\ h t is a Nash equilibrium 
of the continuation game, namely for all i G Af, 

Ui(<r\ h *) > ^i(OiU*,o"-iU*)> for a11 °i- ( 7 ) 

We define the equilibrium payoff as a vector of payoffs v = (Ui(cr), . . . , Un(<t)) achieved at 
the equilibrium. 

B. The Policy Design Problem 

The primary user or the regulatory agency aims to maximize an objective function defined 
on the SUs' payoffs, W(Ui(cr), . . . , Un(ct)). This definition of the objective function is gen- 
eral enough to include the objective functions deployed in many existing works, such as El- 
ATI lfT5llfT6l . An example of the objective function is the weighted sum payoff Y^i=i w iUi, where 
{wi}^ are the weights satisfying Wi G [0,1], Wi and 2~2?=i w i = 1- The PU (respectively, the 
regulatory agency) maximizes the objective function for the revenue (the spectrum efficiency), 
while maintaining the IT constraint ([2]). To reduce the cost of sending distress signals, a constraint 
on the false alarm probability is also imposed as T(p) < f\ where f is the maximum false alarm 
probability allowed. At the maximum of the welfare function, some SUs may have extremely 
low payoffs. To avoid this, a minimum payoff guarantee ji > is imposed for each SU i. To 



August 10, 2012 



DRAFT 



12 



Step 1: Quantify the set of 
Pareto optimal equilibrium 
payoffs 

Step 2: Determine the 
optimal equilibrium payoff 




Step 3: Construct the optimal 
\ spectrum sharing policy 



SU l's payoff 



Fig. 3. The procedure of solving the design problem. 



sum up, we can formally define the policy design problem as follows 

max W(U 1 {tr),...,U N (tr)) (8) 
s.t. er is public perfect equilibrium, 

r(<r(/i*)) < f , Vt, W G Y\ 
Ui(o) > 7f , Vi e A/". 

IV. Solving The Policy Design Problem 

In this section, we solve the policy design problem d8]) following the procedure outlined in 
Fig. [3j We first quantify the set of Pareto optimal equilibrium payoffs (i.e. the Pareto optimal 
payoffs that can be achieved by deviation-proof policies), then determine the optimal equilibrium 
payoff based on the welfare function, and finally construct the deviation-proof policy to achieve 
the optimal equilibrium payoff. 

A. Quantify The Set of Pareto Optimal Equilibrium Payoffs 

The first step in solving the design problem ([8]) is to characterize the set of Pareto optimal 
equilibrium payoffs for the dynamic spectrum sharing system. In particular, we are interested in 
the case when the SUs are impatient (their discount factor is strictly smaller than 1), as opposed 
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to the asymptotic case when the SUs are arbitrarily patient (their discount factor goes to 1) in 
lfl5lllfT6lllfT9ll . For repeated games with perfect monitoring, the characterization of Pareto optimal 
equilibrium payoffs with impatient users is provided in [18j. Our result in Theorem [TJ is the first 
one that analytically quantifies the set of Pareto optimal equilibrium payoffs for repeated games 
with imperfect monitoring and impatient users. 

For the spectrum sharing systems with strong multi-user interference, recall from Definition [T] 
that the set of feasible payoffs can be written as V = conv{(0, . . . , 0), u(p x ), . . . , u(p 7V )}, and 
that its Pareto boundary is B = {v : Y^H=x v i/^i — 1> v i — 0,Vz}. Now we need to determine 
which portion of the Pareto boundary B can be achieved as equilibrium payoffs (i.e. payoffs that 
can be achieved by deviation-proof policies). 

Before stating Theorem [TJ we define the benefit from deviation as follows. 

Definition 3 (Benefit From Deviation): We define SU j's benefit from deviation from SU i's 
payoff maximizing power profile p l as 

, p(3/o|p*) - p(yo\Pj,P-j) , m 

bij = max —. — . (9) 

PiePimity ''.,(/'/• P' ,) ; ''.; 

Our definition of the benefit from deviation results from two intuitions. First, whether there is a 

benefit from deviation should depend on whether the deviation can be statistically detected. A 

deviation can be statistically detected only if p(yo\p l ) < p{yo\Pj, P-j)- This is because p(yo\p l ) < 

p{yo\Pj,P-j) implies that the probability of sending the distress signal is larger when the power 

profile is (pj,pL. 3 -)» m which SU j deviates from p l - to pj, than the corresponding probability 

when the power profile is p\ in which SU j does not deviate. Hence, it is statistically correct 

for the SUs to associate the receipt of the distress signal y with the event of deviation. Since 

Uj(pj,p l _j)/vj is always larger than 0, the benefit from deviation is negative if and only if 

p(yo\p l ) < p{yo\Pj,P-j)- In other words, there is no benefit but only cost from deviation if the 

deviation can be statistically identified by the distress signal. 

Second, the benefit from deviation depends on how likely deviation can be detected (reflected 

by \p(yo\p l ) — p{Vo\Pji P-j)|)> as wei l as how much a SU can gain from deviation (reflected 

by Uj{pj^ l _j)/vj). Since < 0, its absolute value can be considered as the cost from 

deviation. The cost from deviation \bij\ increases with \p(yo\p l ) ~ p(yo\Pj,P-j)\, the likelihood 

that a deviation is detected. In addition, \bij\ decreases with Uj(pj,f> l _j)/vj, the payoff SU j 

obtains from deviation normalized by its maximum payoff. 
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Now we state Theorem [T] which analytically quantifies the set of Pareto optimal equilibrium 
payoffs. 

Theorem 1: We can achieve the following set of Pareto optimal equilibrium payoffs 



N 

Vi , V. 



**= v: ££ = 1 > (10) 



, Vi V. 
1=1 



where jj,. = max^ 1 p(yo\P J ) ^ ^ an( j on iy jf g rs ^ me f n ow i n g two sets of conditions are satisfied 



for all i E M and for all j ^ i: 

• Condition 1: benefit from deviation < 0; 

• Condition 2: no incentive for SU i to deviate: 



« Ui(Pi,P-i) . P 2/o P l )-p (VoPuPU) . n 
1 = + V r > 0, Vpi 



and second, the discount factor 5 is larger than a threshold: 

5>5= ^ . (11) 

I" iV-l+EiejvE^iC-pfeolP^/bij) 

Proof: We provide an outline of the proof here. Please refer to Appendix [A] for the complete 
proof. 

The proof heavily replies on the concept of self-generating sets [26]. Simply put, a self- 
generating set, associated with a discount factor, is a set in which every payoff is an PPE payoff 
under the associated discount factor [26] . Any self-generating set is associated with a minimum 
discount factor; any discount factor larger than the minimum one can be associated with that 
self-generating set. The main contribution of the proof is to find the largest self-generating set 
and the associated minimum discount factor. Since we focus on the Pareto optimal equilibrium 
payoffs, we restrict to the self-generating sets on the Pareto boundary. This restriction allows us 
to obtain the analytical expression of the largest self-generating set By,. Meanwhile, the sufficient 
and necessary conditions for By to be self- generating are obtained. ■ 

Theorem [T] provides the sufficient and necessary conditions for the existence of Pareto optimal 
equilibrium payoffs. Condition 1 (respectively, Condition 2) ensures that at the power profile p\ 
SU j for any j ^ i (respectively, SU i) has no incentive to deviate. When the conditions are 
satisfied, Theorem [Tj quantifies the set of Pareto optimal equilibrium payoffs By. We can choose 
any payoff in By as the deviation-proof operating point. Theorem [T] also gives us the minimum 
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discount factor under which any payoff in B^ is achievable. We can determine the maximum 
level of impatience the users can have in order to achieve any payoff in B^. 

Remark 1: Note that we have not assumed the monotonicity of the payoff function i^. If each 
SU's payoff function increases with its own transmit power, then Condition 2 in Theorem [T] holds 
true as long as Condition 1 is satisfied. 

Remark 2: Note also that the set of Pareto optimal equilibrium payoffs B^ could be empty if 
/i. is large. More precisely, B^ is nonempty if and only if A*- < 1- 

B. Determine The Optimal Operating Point 

Since we have identified the set of Pareto optimal equilibrium payoffs B^, the problem of find 
the optimal operating point that solves the policy design problem can be written as 

max W(vi,...,v N ) (12) 

V 

s.t. {v 1 /v 1 ,...,v N /v N )eB !i , 
Vi > 7i, Vi G M. 

The linear constraints in the above problem can be further simplified as vi > m&x{[i.-Vi, 7«}, V? G 
M. Hence, we get the sufficient and necessary conditions under which the optimization problem 



( |T2| ) is feasible: 

X\ eA fmax{/i ji/vi} < 1 • (13) 



The optimization problem (fT2]) is easy to solve when W is a convex function in (t> 1; . . . , vn) 



For example, if the objective function is the weighted sum of the users' payoffs, namely W = 
2~2iLi w i v u tne solution can be obtained analytically as vf* = (1 — m ax{/f^., ■ Vi for 

i* = argmaxjg^ WjVj, and v* = ma.x{fi.,'y i /vi} ■ for all i ^ i*. 

C. Construct The Deviation-Proof Policy 

Given the optimal payoff v* £ B^, we can construct the deviation-proof policy that achieves 
the payoff v*. According to Definition [TJ any payoff v* G B^ should be achieved by alternating 
among N operating points: u(p 1 ), . . . , u(p Ar ). Hence, the deviation-proof policy a* satisfies 
cr*(/i*) G {p 1 , . . . , p^} for any t > and for any public history h l G Y t . Since only one SU 
transmits in a time slot, the deviation-proof policy can also be regarded as a scheduling in a 
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TDMA fashion. By judiciously deciding which SU can transmit in each time slot, each SU 
i receives a discounted expected average payoff v* and has no incentive to deviate from the 
policy. The deviation-proof policy can be implemented by each SU in a distributed manner. The 
algorithm run by SU i is described in the algorithm in Table |nj 

The intuition of why the algorithm in Table [TT] works is as follows. At each time slot t, each 
SU i calculates the indices for all the SUs, aj(£),Vz G N, where 

a i{t) — ; T~^-, 7 , 1~ -n /, — C) V7 G A/. 

The index oii(t) measures SU f s "urgency" to transmit at time slot t. The SU i* with the largest 
index on* (t) = maxj oti(t) will transmit at time slot t. When no distress signal is received (which 
indicates no deviation), SU i*'s index in the next time slot is very likely to be small, in order to 
give the other SUs larger opportunities to transmit. However, when the distress signal is received 
(which indicates deviation), they calculate the indices in a different way, such that SU i* still has 
a large index in the next time slot. Hence, a SU may not have the incentive to deviate, because 
it will leads to a smaller opportunity to transmit in the future. 

Theorem [2] ensures that if all the SUs run the algorithm in Table [TT] locally, they will achieve 
the optimal operating point v*, and will have no incentive to deviate. 

Theorem 2: For any target payoff v* G £> M , and any discount factor 5 > 5, the strategy 
generated by each user running the algorithm in Table [II] is PPE and achieves v*. 

Proof: We provide an outline of the proof here. Please refer to Appendix [B] for the complete 
proof. 

The key to the proof is to demonstrate that all the payoffs {v'^t) ■ Vi}^^, Vt > generated 
in the algorithm in Table [iTJ are in the self-generating set (the set of Pareto optimal equilibrium 
payoffs) B^. U 

D. Implementation Issues 

We discuss the implementation issues of our proposed design framework, which can be 
implemented in three phases as illustrated in Fig. [4] In Phase I, the LSS exchanges some 



information with the SUs following the procedure described in Table III In Phase II, using 
the information obtained in Phase I, the LSS quantifies the set of Pareto optimal equilibrium 
payoffs, and solves the policy design problem for the optimal equilibrium payoff. Finally in 
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TABLE II 

The algorithm run by user i. 



Input: The normalized target payoffs {v* /vi}i e M given by the LSS 

Initialization: Set t = 0, v'j(0) = v* /vj for all j € M. 

repeat 

Vj (t) — Lt . 

Calculates the index a. At) = - — mrr^ — fttti — niYj 

Finds the largest index i* = aigm&Xj^ aj(t) 
it i — i* then 

Transmits at the power level p\ 
end if 

Updates v'j(t + 1) for all j G Af as follows: 

if No Distress Signal Received At Time Slot t (y* = yi) then 

„<. (t + 1) = | • 4. (t) — (| — i) • (i + E 3 ^. ^%^) 

_*) 



«J(t + 1) = | • wj(t) + (I - 1) • ^^,Vj e M, j * i* 



else 



«<.(« + 1) = | • 4. (t) -(i-i)-(i- E 3 ^» eiJ ^) 
v>(t + 1) = I • «;(t) - (| - 1) • ^^,vj g aa, jV ** 



end if 
until 



Phase I: 

Information 
exchange between 
the LSS and SUs 
(Table HI, Table IV) 




* su 1 



(See 7afo/e ///for a detailed description 
of the information exchange procedure, 
and Table IV for the amount of 
information exchanged) 



Phase II: 

LSS determines 



Step 1 : Quantify the achievable equilibrium payoffs (by Theorem 1 ) 



the optimal g( e p 2\ Determine the optimal equilibrium payoff to achieve 
equilibrium payoff 
[Theorem 1) 



Phase III: 

Decentralized 
implementation f Initialization: 
by SUs ^optimal payoff 



( Theorem 2, 
Table II) 



SU 1's local solver 



SU Isfs local solver 



LSS measures 
the interference 
temperature 



| distress_signa[ _ 

Architecture of implementing the algorithm in Table II 



Fig. 4. Illustration of the implementation. 
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TABLE III 
The Information Exchange Phase. 



Events 


Information obtained 


SUs choose {p'jigAf 


LSS: {p(yo\p*)}i&* 


SUs choose (p* , p!_ j ) , Vj, Pj 


LSS: p(yo|Pi,P-i),Vj,pj 


LSS broadcasts 


SU i: ^(yolpOiPbolpj.pii) 


SUs broadcast 


LSS, SUs: 6ij,Vi,j / i 


SUs send to LSS 


LSS: {^ijieAT 



TABLE IV 

Comparison of the total amount of information exchanged. 





The total amount of information exchanged 


(31-13(12 


O(N) per iteration • # of iterations 


Go) CD 


0(N 2 ) per iteration • # of iterations 


Proposed 


E»E^il^l + ^ 2 + i 



Phase III, the LSS sends the optimal equilibrium payoff to the SUs, as an input to each SU's 
decentralized algorithm of constructing the optimal deviation-proof policy. 

1 ) Overhead of information exchange: We briefly comment on the overhead of the information 
exchange in the proposed framework. First, the information exchange in Phase I is necessary 
for the LSS to determine and for the SUs to achieve the optimal equilibrium payoff. A similar 
information exchange phase is proposed in |[T5l[fT6l [ [ 22 | -[[24 | . The information exchange phase 
can be considered as a substitute for the convergence process needed by the algorithms in Il4l- 
ll7Tl lflT^lfl^lfl"3ll . In the proposed policy, since the players implement the policy without any 
information exchange in Phase III, the only information exchange happen in Phase I and at 
the end of Phase II (when the MU broadcasts the optimal equilibrium payoff). The information 
exchange method in our framework is advantageous in that its duration and the amount of 
information to exchange are predetermined. On the other hand, the amount of information to 
exchange in 1141-1171 IflOl Ifl2ll [[131 is proportional to the convergence time of their algorithms, which 
are generally unbounded. We summarize the overhead of information exchange (measured by 
the number of real numbers or pilot signals transmitted) in the related works in Table [TV] 
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2 ) Computational complexity: As we can see from Table [TT], the computational complexity 
of each SU in constructing the optimal policy is very small. At each period t, each SU only 
needs to compute N indices {cxj(t)}j&/^, and N normalized payoffs {vj(t)}j e j^, all of which 
can be calculated by analytical expressions. In addition, although the original definition of the 
strategy requires each SU to memorize the entire history of measurement outcomes, in the actual 
implementation, each SU only needs to know the current measurement outcome and memorize 
iV normalized payoffs {^-(OL'eAT- 

V. Simulation Results 

In this section, we demonstrate the performance gain of our spectrum sharing policy over 
existing policies, and validate our theoretical analysis through numerical results. Throughout 
this section, we use the following system parameters by default unless we change some of them 
explicitly. The noise powers at all the SUs' receivers are normalized as dB. The maximum 
transmit powers of all the SUs are 10 dB, Vi. For simplicity, we assume that the direct channel 
gains have the same distribution gu ~ £/V(0, 1), Vz, and the cross channel gains have the same 
distribution ~ CA/"(0, (3), Vz ^ j, where (3 is defined as the cross interference level. The 
channel gain from each SU to the LSS also satisfies g i0 ~ CJ\f(Q, 1), VI The IT limit set by the 
PU is I = 10 dB. The measurement error e is Gaussian distributed with zeros mean and variance 
0.1. The maximum false alarm probability is f = 10%. The SUs' payoffs are their throughput 
as in ([TJ). The welfare function is the average payoff, i.e. W = 2~2iLi ^ e mm i mum payoff 
guarantee is 10% of the maximum achievable payoff, i.e. % = 0.1 • V{, \fi. 

A. Performance Evaluation 

1 ) Comparison with policies with constant power levels: We first compare the performance 
of the proposed policy with that of the optimal policy with constant power levels. The optimal 
policy with constant power levels (or "the optimal stationary policy") is the solution to the 
modified version of the design problem ([8]). First, we add an additional constraint that the power 
profile is constant, namely c(/i*) = p* for all t > and for all h l € Y t . Second, we drop the 
incentive constraint that cr is PPE from ([8]). Hence, the performance of the optimal stationary 
policy is the best that can be achieved by existing stationary policies ffl- lfTTTl . and is an upper 
bound for the deviation-proof stationary policies [fT2 l -[[T4 l . 
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N = 2 



N = 3 




5 10 

Cross Interference Level 
N = 4 




5 10 

Cross Interference Level 
N = 5 



-* — Proposed 
~9 — Stationary 



5 10 

Cross Interference Level 



0* ♦ * »»oooooooooooooooo 

5 10 

Cross Interference Level 



Fig. 5. Performance comparison of the proposed policy and the optimal policy with constant power levels ('stationary' in the 
legend) under different numbers of users and different cross interference levels. A zero average throughput indicates that there 
exists no feasible policy that satisfies all the constraints in the policy design problem. 



In Fig. [5J we compare the performance of the proposed policy and that of the optimal stationary 
policy under different cross interference levels and different numbers of SUs. As expected, the 
proposed policy outperforms the optimal stationary policy in medium to high cross interference 
levels (approximately when f3 > 1). In the cases of high cross interference levels (J3 > 2) and 
many users (N = 5), the stationary policy fails to meet the minimum payoff guarantees due to 
strong interference (indicated by zero average throughput in the figure). On the other hand, the 
desirable feature of the proposed policy is that the average throughput does not decrease with 
the increase of the cross interference level, because SUs transmit in a TDMA fashion. For the 
same reason, the average throughput does not change with the number of SUs. 

Note that the proposed policy is infeasible (zero average throughput) when the cross interfer- 
ence level is very small. This is because it cannot be deviation-proof in this scenario. When the 
interference level is very small, SU j can deviate from p l and receives a high reward Uj(pj, p* •) 
because the interference from SU i, plgij, is small. Hence, the benefit of deviation bij is large, 
and the deviation is inevitable. This observation leads to an efficient way for the LSS to check the 
cross interference level without knowing the channel gains. If the proposed policy is infeasible, 
the LSS knows that the cross interference level is low, and can switch to stationary policies. 
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— * — Proposed, False alarm probability 10% 
— B — Optimal Punish-Forgive, False alarm probability 10% 
- + - Proposed, False alarm probability 30% 
0.5 - - B - Optimal Punish-Forgive, False alarm probability 30% 

+ Proposed, False alarm probability 50% 
□ ■ Optimal Punish-Forgive, False alarm probability 50% 

Ql 1 ' | I I I I I 

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 ( 
Variance of Measurement Error 



Fig. 6. Performance comparison of the proposed policy and the punish-forgive policy with the optimal punishment length 
under different error variances and different false alarm probabilities. 



2) Comparison with "punish-forgive" policies proposed under perfect monitoring: We also 
compare the proposed policy with existing policies designed under the assumption of perfect 
monitoring lfl"5l - lfT8Tl . Specifically, we consider the "punish-forgive" policy in lfl"5ll - lfT8ll . which 
requires SUs to switch to the punishment phase of L time slots once a deviation is detected. 
In the punishment phase, all the SUs transmit at the maximum power levels to create high 
interference to the deviatoiQ A special case of the punish-forgive policy when the punishment 
length L = oo lfl"5l is the celebrated "grim-trigger" strategy in game theory literature ll25l . 
As discussed before, the punish-forgive policy works well if the SUs can perfectly monitor the 
individual power levels of all the SUs, because in this case, the punishment serves as a threat and 
will never be carried out in the equilibrium. However, when the SUs have imperfect monitoring 
ability, the punishment will be carried out with some positive probability, which decreases all 
the SUs' average payoffs. 

Fig. [6] shows that the proposed policy outperforms the punish-forgive policies under different 
variances of measurement errors and different false alarm probabilities. For each combination of 
the error variance and the false alarm probability, we choose the punish-forgive policy with the 

Note that all the SUs transmitting at the maximum power levels. For the punish-forgive policy |15|-|18|, we allow the 
violation of the IT constraint in the punishment phase. Note that the IT constraint is never violated in the proposed policy. 
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Fig. 7. The impact of the variance of the measurement error on the performance of the proposed policy and the minimum 
discount factor required under which the proposed policy is deviation-proof. 

optimal punishment length. The performance of punish-forgive polices degrades with the increase 
of the error variance and the false alarm probability, because of the increasing probability of 
mistakenly triggered punishments. Some interesting observation on how the performance of the 
proposed policy changes with the error variance and the false alarm probability is explained in 
details in the following subsections. 

B. Impacts of Variances of Measurement Errors 

Fig. [7] shows that with the increase of the variance of measurement errors, the average 
throughput decreases, and the SUs' patience (the discount factor) required to achieve Pareto 
optimal equilibrium payoffs increases. First, when the error variance increases, the intermediate 
IT limit I must decrease to maintain the constraint on the false alarm probability. The decrease 
of / leads to the decrease of SUs' maximum transmit power levels allowed, which results in 
the decrease of the average throughput. Another impact of the increase in the error variance is 
that p(y \pj,f>'Lj) = J x> j_ p h -pih o fe{ x )dx increases, which leads to the increase of benefit of 
deviation b^. Hence, the minimum discount factor 5 increases according to Theorem [T] 
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Fig. 8. The impact of the false alarm probability on the performance of the proposed policy and the minimum discount factor 
required under which the proposed policy is deviation-proof. 



C. Impacts of Constraints on The False Alarm Probability 

Fig. [8] shows that with the increase of the false alarm probability limit f , both the average 
throughput and the users' patience (the discount factor) required to achieve Pareto optimal equi- 
librium payoffs increase. First, with an increased false alarm probability limit, the intermediate 
IT limit I can increase, which leads to an increase of the SUs' maximum transmit power levels 
and thus an increase of the users' throughput. Meanwhile, since 

p(z/o|p l ) - p{Vo\Pj, pLj) = - f £ (x)dx 

Jl-I-h.QjPj 

increases when / increases, the benefit of deviation increases. This leads to an increase of 
the minimum discount factor. 

This observation indicates an interesting design tradeoff. On one hand, a smaller false alarm 
probability can reduce the overhead of sending distress signals, and can also relax the requirement 
on SUs' patience. On the other hand, a larger false alarm probability can increase the average 
throughput, such that the spectrum efficiency or the revenue can increase. Our theoretical results 
characterize such a tradeoff, which can be used to choose the optimal intermediate IT limit /. 
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VI. Conclusion 

In this paper, we studied power control in dynamic spectrum sharing among SUs under the 
interference temperature constraint, and proposed a dynamic spectrum sharing policy that allows 
SUs to transmit in a TDMA fashion. The proposed policy can achieve Pareto optimal operating 
points that are not achievable under existing spectrum sharing policies with constant power 
levels. The proposed policy is amenable to distributed implementation and is deviation-proof, in 
that the SUs are in their self-interests (i.e. maximizing their own QoS) to follow the policy. The 
proposed policy can achieve Pareto optimality even when the SUs have limited and imperfect 
monitoring ability: they only observe distress signals that erroneously indicate the violation of 
the interference temperature constraint. Simulation results validate our analytical results on the 
policy design and demonstrate the performance gains enabled by the proposed policy. 

Appendix A 
Proof of Theorem Q] 

The proof culminates in the demonstration that under certain conditions, a set of Pareto optimal 
payoffs can be a self-generating set. Then according to (25J Proposition 7.3.1 H1261 , all the payoffs 
in the set are equilibrium payoffs. More specifically, we derive the sufficient and necessary 
conditions (i.e. Conditions 1-3 in Theorem [T]) under which a subset of Pareto optimal payoffs 
is a self-generating set, and find the largest subset of Pareto optimal payoffs that can be self- 
generating (i.e. Bn defined in Theorem [TJ). 

A. Preliminaries on Self- generating Sets 

We first provide some background knowledge related to the self-generating sets. Similar to 
Markov decision processes (MDP's), when we analyze the game, we can decompose the average 
payoff into the current payoff and the continuation payoff (i.e. the average payoff starting from 
the next time slot). However, there are two key differences between the decomposition in a game 
and that in a MDP First, there are multiple users in a game, as opposed to MDP's in which there 
is usually only one user. Second, the incentive compatibility constraints, which are not present 
in a MDP, need to be considered in a game. Hence, the decomposability in a game is defined 
as follows (2H Definition 7.3.21l|26lfl 

7 For the ease of reference, we duplicate the definition in |25 Definition 7.3.2] here. 
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Definition 4 (Decomposability): A payoff v G M is decomposable on a set W C Mr with 
respect to discount factor 5 and (pure) action profile p, if there exists a mapping 7 : Y — > W, 
such that for all i 6 A/", we have 

u< = (i-<y)-«i(p) + <y-5^7*(y)p(y|p) (14) 

> (1 - 5) -Uiip^p-i) + 5- ^2n/i(y)p(y\p'i,p-i), Vp- e 7> f . (15) 

A payoff v is decomposable on a set W with respect to discount factor 5, if there exists an 
action profile p, such that v is decomposable on a set W with respect to discount factor 8 and 
action profile p. 

In the above definition, we can see that each user z's payoff Vi is decomposed into the 
current payoff Wj(p) and the expected continuation payoff Y2 y eY 7i(z/)/°(y|p)» which specifies 
the continuation payoff ^(y) starting from the next period given the signal y. Importantly, the 
decomposition needs to be incentive compatible, in the sense that each user i cannot choose a 
different action p\ to improve the average payoff. For convenience, we write f^(W; 5, p) as the 
set of payoffs that can be decomposed on set W with respect to discount factor 5 and action 
profile p, namely 

@(W; 5, p) = {v G M : v is decomposable on set W with respect to S and p.} (16) 

Similarly, we write ^(W; S) = U pe p£#(W; 5, p) as the set of payoffs that can be decomposed 
on set W with respect to discount factor 5. 

A self-generating set is a set W, in which every payoff v 6 W is decomposable on the set 
W itself. The formal definition is as follows fl25l Definition 7.3.41IE51. 

Definition 5 (Self-generating Sets): A set W is self-generating under discount factor 5, if 
W C ®(W',8). 

The self-generating sets play an important role in repeated game theory, because every payoff 
in a self-generating set is an equilibrium payoff. We restate this important result formally in the 
following lemma ll25l Proposition 7.3.11 ll26ll . 

Lemma 1 (Self-generation): For any bounded set W C M N , if W is self-generating, then 
every payoff in W is an equilibrium payoff of the repeated game. 
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B. Outline of The Proof 

In the above subsection, we have summarized some important results related to self-generation 
in repeated game theory. Now we outline the proof of Theorem [TJ 

Recall that due to Definition [TJ the Pareto boundary of the considered repeated game is 



B 



J v : J2 ? = !> v i ^ °> Vi G AT I . 
I ieN V J 



Consider a subset of the Pareto boundary 



B »= v: E- = 1 ' VieJVL (17) 

I ieJV v v J 

where ^ > for all z G A/". Our focus is to show that under certain conditions, the subset of the 
Pareto boundary £> M can be a self-generating set, which means that every Pareto optimal payoff in 
B^ can be an equilibrium payoff. In the next subsection, we derive the necessary conditions if B^ 
is self-generating. These necessary conditions lead to Conditions 1-3 in Theorem [TJ A byproduct 
of the first necessary condition are the constraints on the boundary fi of the self-generating sets 
Bfj, (i.e. the lower bound n of fi in Theorem [TJ), which leads to the characterization of the largest 
possible self-generating set B^. In the final subsection, we show that these necessary conditions 
are also sufficient for B^ to be self-generating. 



C. Necessary Conditions For a Set of Pareto Optimal Payoffs To Be Self-generating 

Suppose that B^ is self-generating. Then for any payoff v G B^, there exists an action profile 
p and a mapping 7 : Y — > B^, such that for all i G A/", we have 

Vl = (i-5)- Ui (p) + 5-J2n(y)p(y\p) (is) 

y& 

> {l-6)-u i tf i ,P-i) + 6-Y t 'ri(v)pWi,P-i), VtitVi. (19) 

The first observation is that the action profile p that decomposes a Pareto optimal payoff v G B^ 
must be a payoff-maximizing action profile for a certain user. In other words, p G {p 1 , . . . , p^}. 
This is because the average payoff v and the continuation payoffs 7(2/), Vy G Y, are all on the 
Pareto boundary B. In other words, YlieN Vi l^ i = ^ an ^ YlieMli{v)/^i = 1, G Y. Since the 
average payoff is the convex combination of the current payoff and the expected continuation 
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payoff, the current payoff must also lie on the Pareto boundary, i.e. J2ieAf u i(p) / = 1- 
According to Definition [T| the only action profiles that lie on the Pareto boundary are p 1 , ... , p^. 

Based on the above observation, we have @(W; S) = U, e ^-^(W; S, p l ). Suppose that a payoff 
v G £> M is decomposed by p\ namely v G @(W;5, p l ). Using the facts that Uj(p l ) = and 
Wj(p l ) = 0,Vj 7^ z, we have 

Vl = (i-s)-v i + s-J2ii(y)p(y\v i ) (20) 

y&Y 

> (1-5)- u^pij + 5 ■ J2-yi(y)p(y\pi,pU), ^Pi e Pi, 

and for all j ^ i, 



s-J^iMpivlP) ( 21 ) 

> (1 - 5) ■ Pl;) + 5 ■ lj(vMv\Pi,P-j), VPj e V 3- 

Since user j ^ i chooses p L - = in action profile p l , we say that under action profile p\ user i 
is the active user and user j ^ i is an inactive user. 

Next, we show that the incentive compatibility constraints for inactive users and the active 
user imply Condition 1 and Condition 2 of Theorem [TJ respectively. The incentive constraints 
for inactive users also give us constraints on the boundary fi of £> M . In addition, to make sure 
that -y(y) G B^Vy, the discount factor should satisfy Condition 3 of Theorem [T] 

1) Incentive Constraints For Inactive Users: We examine the incentive compatibility con- 



straint for an inactive users j ^ i in plj ), which will lead to the first necessary condition. First, 
since Uj(pj, p l _j) > 0, Vp 3 - > 0, for the inequality in pT) to hold, we must have J2 yGY 1 j(v) PivW) > 
^y£Ylj(y)p(y\PjiP~j)> which is equivalent to 

[p(yo|p*) - p(vo\Pj, plj)] ■ (r/Avo) - 7,(2/1)) > 0, v Pj > 0. (22) 

Note that the probability of receiving distress signals given action profile ( P j, pl. •) is no smaller 
than the probability given p\ because 

p{y \ Pj , pl,) - p(y |pi = / PZ9 '° fe(x)dx > 0. (23) 

Since p(yo\ P j,pLj) > p(?/o|pi> we must have 77(2/1) > Tj (2/0) • This requirement is intuitive: 
we should set a lower continuation payoff following the distress signal j/o in order to deter user 
j ^ i from deviating from p\ 
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From the equality constraint in pT]), we have 



<v 3 ^ P] ^p). (25) 



V ( \ ■ (24) 

Plugging in the above expression of 5, we can eliminate discount factor 5 in the inequality of 
pTj ) and obtain an equivalent inequality as follows 

For notational simplicity, we write the coefficient of 77(2/1) in the above inequality as 

4 <26) 

= " (!,l|p)+ ^ — — <27) 

= p(yi|p)+t,i — ^T>g — • <28) 

and define the maximum value of the coefficient 

cj = max cy(Pi,pLj) (29) 

Pj^hPo^P) 

( p(yo\p l ) - p{yo\Pj,plj) 

= Kl/i P ) + ■ max ? — -r (30) 



Since 7^(2/1) > 7j(z/o)» me set °f inequality constraints in pi] ) 

cy(Pj, pLj) • 7i(l/i) + C 1 - c *j(Pj ; P-j)) • Ti(yo) < «i, (31) 
for all > 0, is equivalent to a single constraint 

4-7i(yi) + (l-4)-7iG/o) (32) 
Hence, the incentive constraints pTj ) for user j 7^ z can be rewritten as 

p(yi|P*) • 7i(l/i) + (! - p(j/iIp*)) • 7j(yo) = t 

(33) 

4j ■ 7j(Vi) + (1 - 4) ' Ti(2/o) < ^ 
where ^ • ■Uj < 7 3 -(y) < ^-,Vy G F. 

The first necessary condition of B^ C ^{B^ 5) is < 0, as stated in the following 
proposition. 

Proposition 1: If i3 M C ^{B^ 5), then c+ < for alH e M and for all j ^ i. 
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Proof: If B^ C ^(B^S), then any payoff v in B^ should satisfy v G ^(B^S). Pick a 
payoff v*, in which 

*>} = < V 7 • (34) 

Note that v l is the payoff profile in which every user j ^ i has the smallest payoff ■ Vj and 
user i has the largest payoff ^1 — E/c^i/^J ' ^i- We show that v l G ^{B^ 5) implies cfj < 
for all j ^ i. 

First, v l can only be decomposed by p\ Otherwise, suppose that v l is decomposed by p?, j ^ i. 
Then the decomposition of user i's payoff is 

vl = 5- { P ( yi \v 3 ) ■ n(yi) + (l - p{yi\P j )) ■ n(y )) • (35) 

Since the convex combination of ji(yi) and 7,(2/1) is equal to v\/8, which is strictly larger than 
v\, at least one of 7t(yi) and 7j(yi) is strictly larger than uj. However, %(y) G £> M implies that 
J%(y) < #i>V?/ G F, which leads to contradiction. Hence, v* can only be decomposed by p\ 

Now that v* is decomposed by p\ we focus on the incentive constraints for an arbitrary user 
j ' i in ( [33] >. From the equality in ([33]) and the requirement that ^(yx) > lj{yo), we have 



lj{yx) > Vj/6 > Vj. Then suppose that c+ > 0, in order to satisfy the inequality in ( |33] ), we 
must have jj(yo) < which is contradictory to the fact that 7,(3/0) G Hence, we must have 

c+- < for all j 7^ i. 

Since the above argument of v l applies to any i G A/", we have < for alH G A/" and for 
all j / i. U 

The first necessary condition that ct < has two implications. First, since p(yi\f> 1 ) and Vj 
are both nonnegative, we have 

pft/olp') - p{yo\Pj, pLj) 

max — =£- < 0, (36) 

p.e^P^ "./(/'./• P j) 

where leads to Condition 1 in Theorem [T] that benefit from deviation bij < 0. 
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Second, to decompose v\ we have 



pG/olp*) ~ p(yo\Pj,P-j, 



4 = p(yi\P l )+v}- max ' (37) 

= pd/ilpO+W-— (38) 

v j 

= Pivill?) + • bij (39) 

< 0, (40) 



which gives us a lower bound on fij, namely 



, ^ p(yi\p l ) _ i -p(yo\i?) f , u 
™ -b- ~ -b- ' 



Since v J should be decomposed for all i E M, we have 

max - 

i+j -0 



1 - p(y \p l ) 

fij > max , (42) 



which leads to the lower bound ji. in Theorem |TJ 

2) Incentive Constraints For The Active User: We examine the incentive constraints for the 



active user i in p0| ), which will lead to the second necessary condition (i.e. Condition 2 in 
Theorem [T]). 

Suppose that a payoff v E B^ is decomposed by p\ We rewrite the incentive constraint for 
the active user i here 

Vl = (i-6)-v l + 5-J2^(y)p(y\p l ) (43) 

y&Y 

> (i - 5) ■ u^pU) + 5 -J2^(y)p(y\^p-i), Vpi^Vi. 

y& 

Since *y(y) E B^, given the inactive users' continuation payoffs 7j(y), the active user's contin- 
uation payoff is determined by ji(y) — Vi (l — ^^)- 

First, it is not difficult to check that if {jj{y)}j^i, Vy satisfy the inactive users' equality 
constraints in ( [33] ), then ^(y) = Vi ^1 — ^y^J will satisfy the active user's equality 
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constraint in ((45]). 

(i -8) ■v l + 5-J2^(y)p(y\f> i 

yeY 



[i - s) ■ Vj + s ■ y] Vj 

yeY 

'l-5)-v l + 5-^ P (y\p l )-5-J2J2 

y&Y yeY j^i 




p(y|p ? ) 



Vi 



3+i y€Y 



Vi-5 -Vi^2 



Vj/5 



V 3 



Vi. 



The inequality constraint in ( |45| ) requires that the active user i has no incentive to choose 
another action pi ^ p\. Although the active user z's current payoff is maximized at p\ it may 
still have the incentive to deviate for the following reason. Since 77(2/1) > 7/(2/0) for all j 7^ i, 
we have 7^(2/1) < 7i(2/o)- I n other words, the active user z has a larger continuation payoff when 
the distress signal 2/0 is received. Hence, it may want to deviate, such that the probability of 
receiving the distress signal is increased, if the increase of the expected continuation payoff 
outweighs the decrease of the current payoff. To prevent the active user i from deviating, we 
should make its continuation payoffs 71(2/1) and 71(2/0) as close as possible. Equivalently, we 
should make the inactive users' continuation payoffs 77(2/1) and 7/(2/0) as close as possible. 

For an inactive user j 7^ i, the closest continuation payoffs that satisfy the incentive constraints 
( |33| ) are the ones that satisfy the inequality with equality. Hence, we can solve for the continuation 
payoffs as 



13 > 



;i-p(2/i|p0) 



p{y\W 



5^3 



p(yiW 



■Vj. 



(44) 



Given the inactive users' continuation payoffs, we can obtain the active user's continuation 
payoffs 7i(2/i) and 71(2/0)- Plugging the expression of 7,(2/1) and 7/(2/0) into the inequality in 
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< |45| ), we have for all pi ^ f>\, 

Vi>(i-S)- Ui(p h p 4 _i) + 5 ■ ii(y)p(y\Pi, P- 



^ Vi - (1 - 5) • Ui(pi, pij - 5 



> 



^ 1 - 5 • v< - (1 - 5 ■ ^(pi, pLJ + 1 - 5 )■ Vi ■ \ — — ^ ■ ^ > 

^ p(l/i|p l )-4 w i 

^ V p(»iIp')-4 / «j 



** °< ' I S 1 + E £ ) - ".(p.. P-.) + * ' E ' P-.) - ''('flip'') a 



v t - Ui{pi, plj + Vi- > 0, 

which leads to Condition 2 in Theorem Q] 

3) Constraints On The Discount Factor: Now we derive the necessary conditions on the 
discount factor. The minimum discount factor 5(h) required for £> M to be a self-generating set 
can be solved by 

5(h) = max 5, subject to v £ $(3^,5). (45) 
veB M 

Since i^(£y; 5) = \Ji e tf@(B li ; 5, p l ), the above optimization problem can be reformulated as 

<5(/x) = maxmintf, subject to v £ @(Bn.\ 5, p l ). (46) 



To solve the optimization problem ([46]), we explicitly express the constraint v £ ^(B^ 5, p l ) 
using the results derived in the previous two subsections. The inactive users's continuation payoffs 
have been derived in d44l), which determine the active user's continuation payoffs. Hence, the 



constraint v £ ^(B^; 5, p l ) on discount factor 5 is equivalent to 

j(y)eB„VyeY, (47) 
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which can be written explicitly as 



|(i-4)-(i-p(i/i|p ?; )) 



Vj e {fij -v^VjlVj ^ 



p{yi\i?) - % 



7(2/1) = ^(i-E— ) 

7i(?/o) = u< ( 1 - J G ^ ' ^ 

v & Vj J 

Since 77(2/1) > 7j(z/o)> the constraints on 7,(3/1) and 7,(3/0) can be simplified as 



7iU/i) = — 



p(j/i|p*) - c 



Uj- < 



I" 1 (p^IpO-cJ) 



and 



, , 1(1 - 4) - (1 - P(W|P*)) 

7j(yo) = — 



p(l/i|p 4 ) -c- 



U 3 - > /ij -Uj. 



'7 



Note that the constraint ( |54| ) will be satisfied as long as < 0. 

Since 71(3/1) < 7i(3/o)> the constraints on 7,(3/1) and 71(3/0) can be simplified as 

7t(j/i) > A*i • Vi 5 > 



1 + ^ 



1-cT. „ . 
^JS* p( !/1 |pM-c+ Sj 



and 



7*(j/o) < «<■ 



(48) 
(49) 
(50) 

(51) 



(52) 
(53) 



(54) 



(55) 



(56) 



Note that the above constraint on 7,(3/0) is satisfied as long as ( [54] ) is satisfied for all j ^ i. 



Note also that the constraint ( |52| ) is satisfied as long as ( |55j ) is satisfied. 
To sum up, the discount factor needs to satisfy the following constraint: 

s> 1 



1 + 



P(j/llP l )-cT. V 3 



(57) 



i ( pOqIp 1 ) I V J 
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Hence, the optimization problem ( |46| ) is equivalent to 

5(/Lt) = maxminxj(v), (58) 

where 



veB^ i&A/" 



1 



1 + ^ 



l—Ui 



Since £j(v) is decreasing in t>j and increasing in Uj,Vj 7^ i, the payoff v* that maximizes 
minj g _/v- Xi(v) must satisfy Xj(v*) = <£j(v*) for all % and j. Now we find the payoff v* such that 
Xj(v*) = Xj(v*) for all i and j. 

Define z = ^ 7 v = ^ 1-w , ,Vi G Af. Then we can solve for ^ as 

follows 



Uj 1 + z 

Since E igA f = 1> we can sorve f° r 2 as 

1 _ EieAf 



(59) 



(60) 



Hence, the minimum discount factor is 5(/u) = jjj^, which leads to Condition 3 in Theorem |TJ 

D. Necessary Conditions Are Also Sufficient 

In the previous subsection, we have derived three necessary conditions for the set £> M to be 
self-generating. Now we show that the three necessary conditions are also sufficient for £> M to 
be self-generating. 

Given any payoff v G £> M , we can determine the action profile p l that decomposes it and the 
corresponding continuation payoffs based on the results in the previous subsection. First, the 
action profile p* that decomposes v is determined by 

Vj 1 — Ua 

i = axgrnmx,(v) = argmax — — -, — (61) 

J Vj ^k^j -b jk 

Then we determine the continuation payoffs as 

lAVo) = " — p(i/1 |pi)- c + Vj>fi r vjyj^i, ■ (62) 
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Conditions 1 and 2 ensure that the incentive constraints for the active user pOb and the inactive 



users pi) are satisfied by setting the continuation payoffs as above. Condition 3 on the discount 
factor S ensures that the above continuation payoff j(y) G B^. Hence, any payoff v G £> M is 
decomposable on set B^ with respect to discount factor S > 8(n). Then £> M is self-generating, 
and any payoff in £> M is an equilibrium payoff. 

Appendix B 
Proof of Theorem [2] 

We have characterized the largest set of Pareto optimal equilibrium payoffs B^. In the algo- 
rithm in Table II, we start with the target payoff v* G £> M as the average payoff at period 0, and 
decompose it into a current payoff and a continuation payoff. The decomposition tells us what 
action profile to play in period 0. Then we decompose the continuation payoff and determine 
the action profile to play in period 1. By performing the decomposition in every period, we can 
determine what action profile to play given any signal at every period. 

Specifically, suppose that the continuation payoff at period t is v(t). Then the action profile 
p* to decompose v(t) is determined by 

i* = argminx,(v(t)) = argmax — t-t — ; — — --, (63) 

where — v . w p{yo]pj) is exactly user j's index ctj(t). Then we can determine the con- 



tinuation payoff v(t + 1) according to ([62]) 
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